Optimize 1-bit estimator tail path#49
Conversation
Tail-pad report1. What problem does tail-pad address?The 1-bit estimator is built around an AVX-512 512-dimensional path. 2. Current tail-pad methodThe current tail-pad method directly pads the remaining dimensions to a full 512-dimensional block, so the tail can stay on the same AVX-512 path as the main body. In other words:
This keeps the implementation simple and improves the utilization of the AVX-512 path on non-512-aligned dimensions. 3. Single-function resultsThe following table reports median per-call latency for the single-function benchmark (
Overall, the gain is small when the dimension is already 512-aligned, but becomes significant when the remaining tail is large. The largest improvements appear when |
This PR introduces a tail-padded AVX-512 path:
Full 512-dimensional blocks are processed with 512-bit SIMD loads and popcount, while the final partial block is stored compactly and handled with masked AVX-512 loads. This avoids falling back to scalar-style tail processing and keeps the tail path vectorized.